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Abstract: In a Gaussian graphical model, the conditional independence between 
two variables are characterized by the corresponding zero entries in the inverse co- 
variance matrix. Maximum likelihood method using the smoothly clipped absolute 
deviation (SCAD) penalty (Fan and Li, 2001) and the adaptive LASSO penalty 
(Zou, 2006) have been proposed in literature. In this article, we establish the result 
that using Bayesian information criterion (BIC) to select the tuning parameter in 
penalized likelihood estimation with both types of penalties can lead to consistent 
graphical model selection. We compare the empirical performance of BIC with 
cross validation method and demonstrate the advantageous performance of BIC 
criterion for tuning parameter selection through simulation studies. 
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model; Model selection; Oracle property; Penalized likelihood 

1. Introduction 

A multivariate Gaussian graphical model is also known as a covariance se- 
lection model. The conditional independence relationships between the random 
variables are equivalent to the specified zeros among the inverse covariance ma- 
trix. More exactly, let X = (iW,..,!^) be a p-dimensional random vector 
following a multivariate normal distribution N p (/j,, E) with /i denoting the un- 
known mean and E denoting the nonsingular covariance matrix. Denote the 
inverse covariance matrix as E _1 = C = (Cij)i<jj< p . The zero entries Cj,- in the 
inverse covariance matrix indicate the conditional independence between the two 
random variables X^' and X^ given all other variables (Dempster, 1972, Whit- 
taker, 1990, Lauritzen, 1996). The Gaussian random vector X can be represented 
by an undirected graph G = (V,E), where V contains p vertices corresponding 
to the p coordinates and the edges E = (eij)i<i<j<p represent the conditional 
dependency relationships between variables X® and X^\ It is of interest to 
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identify the correct set of edges, and estimate the parameters in the inverse co- 
variance matrix simultaneously. 

To address this problem, many methods have been developed up to date. 
In general, there would be no zero entries in the maximum likelihood esti- 
mate, which results in a full graphical structure. Dempster (1972) and Edwards 
(2000) proposed to use the penalized likelihood method with the Lo-type penalty 
P\(\cij\)i^j = XI(\cij\ / 0), where /(.) is the indicator function. Since the Lq 
penalty is discontinuous, the resulting penalized likelihood estimator is unstable. 
Another standard approach to perform model selection in Gaussian graphical 
model is stepwise forward selection or backward elimination of the edges. How- 
ever, this approach ignores the stochastic errors inherited in the multiple stages 
of the procedure (Edwards, 2000) and causes the statistical properties of the 
method hard to comprehend. Furthermore, the computational complexity of 
this greedy search algorithm increases exponentially with the number of vertices 
in the graph. Meinshausen and Biihlmann (2006) proposed a computationally 
attractive method for covariance selection. The proposed method performs the 
neighborhood selection for each node and combines the results to learn the over- 
all graphical structure. It has been shown that this method is connected to the 
quadratic approximation of the loglikelihood with L\ penalty (Yuan and Lin, 
2007). Nevertheless this method performs the model selection and parameter es- 
timation separately. Yuan and Lin (2007) proposed penalized likelihood methods 
for estimating the concentration matrix with L\ penalty (LASSO) (Tibshirani, 
1996). The method can be implemented through the maxdet algorithm in con- 
vex optimization. However, due to the inherent computational complexity, the 
maxdet algorithm can only handle matrices with small p. 

Banerjee, Ghaoui and D'aspremont (2007) have proposed a block-wise updat- 
ing algorithm for the estimation of inverse covariance matrix. For each block- wise 
update, the problem is a box-constrained quadratic program, which can be solved 
by an interior-point procedure. They further showed that the problem emerges 
from each step of block-wise update is equivalent to a linear regression under L\ 
penalty. Further in this line, Friedman, Hastie and Tibshirani (2008) proposed 
the graphical LASSO algorithm to estimate the sparse inverse covariance matrix 
using the LASSO penalty through coordinate-wise updating scheme. It is the 
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fastest and most convenient algorithm to tackle this problem up to date. Fan, 
Feng and Wu (2009) proposed to estimate the inverse covariance matrix using 
adaptive LASSO and Smoothly Clipped Absolute Deviation (SCAD) penalty to 
attenuate the bias problem. They employed local linear approximation method 
(Zou and Li, 2008) to approximate the LASSO penalty as weighted L\ penalty 
and the method is implemented through the graphical LASSO algorithm. The 
resulted methods with both SCAD and adaptive LASSO penalties are compu- 
tationally convenient algorithms leading to asymptotically unbiased, sparse esti- 
mators which possess oracle property. 

In practice, the performance of the penalized likelihood estimator depends 
on the proper choice of the regularization parameter. In this article, we focus on 
the tuning parameter selection in penalized likelihood estimation of the sparse in- 
verse covariance matrix. Wang, Li and Tsai (2007) proposed to use the Bayesian 
information criterion (BIC) to select the tuning parameter for penalized likeli- 
hood method with SCAD penalty, They showed that BIC with SCAD penalty 
is able to identify the true model consistently in the setting of linear regression 
and partial linear model. Yuan and Lin (2007) used BIC to select the tuning pa- 
rameter with the L\ penalty in the estimation of inverse covariance matrix. But 
the consistency of BIC for Gaussian graphic model has not been investigated. In 
this article, we establish the consistency result of the BIC criterion with both 
SCAD and adaptive LASSO. We show that if SCAD or adaptive LASSO penalty 
is used, the optimum tuning parameter selected by BIC will yield the graphi- 
cal structure identical to the true underlying graphical model with probability 
tending to one as n — ► oo. We also compare the performance of BIC with cross- 
validation method through extensive simulation studies. We demonstrate that in 
small sample size scenario, including the cases when the number of parameters 
greatly exceeds the sample size, BIC exhibits comparable performance as the 
computationally more intensive cross-validation method. However, when sample 
size increases, BIC consistently outperforms cross validation. 

The rest of the article is organized as follows. In Section 2.1 we formulate 
the penalized likelihood function for inverse covariance matrix. In sections 2.2 
and 2.3, we discuss the selection of tuning parameters through the BIC criterion 
and prove its consistency in graphical model selection with SCAD and adaptive 
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LASSO penalty. In section 3, simulation studies are presented to demonstrate the 
empirical performance of the tuning parameter selection with BIC compared with 
the cross validation method in small sample size and large sample size scenarios. 

2.1 Penalized Likelihood Estimation of Inverse Covariance Matrix 

Given a random sample X\, ...,X n following a multivariate normal distribu- 
tion N p (n, E), the loglikelihood for fi and C = can be expressed as 

i=i 

up to a constant not depending on the parameters. The maximum likelihood 
estimator of (//, S) is (X,A), where 

1 n 

A = -Y J {X l -X){X l -X)'. 

Assume that the observations are properly centered, then the sample mean is 
zero. As jl does not depend on C, we have jl = 0. To obtain the maximum 
likelihood estimator of the concentration matrix is equivalent to minimize 

-^i(C) = -\og\C\ +tr(CA). 

To achieve sparse graph structure, penalized likelihood methods have been pro- 
posed in literature and the resulting estimator C should minimize the following 
objective function: 

Q(C) = -log|C| + tr(CA) + J> A (M), (2.1.1) 

with p\ being some penalty function. Yuan and Lin (2007) have proposed to 
use LASSO penalty, p\(\cij\) = X\cij\. Friedman, Hastie and Tibshirani (2008) 
proposed the graphical LASSO algorithm by using a coordinate descent proce- 
dure, which is computationally very fast and guarantees the positive definiteness 
of the resulting estimate. As the LASSO penalty increases linearly with the size 
of its argument, it leads to biases for the estimates of nonzero coefficients. To 
attenuate such estimation biases, Fan and Li (2001) proposed SCAD penalty. 
The penalty function satisfies p\(0) = 0, and its first-order derivative is 

p' x (9) = A{/(0 < A) + ^Z3± i(p > A)}, for 6 > 0, 
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where a is some constant usually set to 3.7 (Fan and Li, 2001), and = tl(t > 
0) is the hinge loss function. 

The SCAD penalty is a quadratic spline function with knots at A and aX. It 
is singular at the origin which ensures the sparsity and continuity of the solution. 
The penalty function does not penalize as heavily as the L\ penalty function on 
large parameters. More important advantage of the SCAD penalty is that the 
method not only selects the correct set of edges, but also produces parameter es- 
timators as efficient as if we know the true underlying graphic structure. Namely, 
the estimators have the so called oracle property. 

Zou (2006) proposed the adaptive LASSO penalty, which imposes a weight 
for each parameter and can be regarded as a weighted version of the LASSO 
penalty. In the current setting, the adaptive LASSO penalty takes the form 
of PA(l c ij|) = ^ w ij\ c ij\t with Wij = l/|cjj| 7 , for some consistent estimator C = 
(cij)i<i,j<p and some 7 > 0. As the empirical performance of the results does not 
differ much for different 7, we follow the conventional choice of 7 = 0.5. 

Both SCAD and adaptive LASSO can be efficiently implemented using the 
graphical LASSO algorithm. For SCAD penalty, Fan, Feng and Wu (2009) pro- 
posed to use local linear approximation (Zou and Li, 2008) to approximate the 
SCAD by a symmetric linear function. The proposed iterative re-weighted pe- 
nalized likelihood method optimizes the objective function at step (k + 1) as 
follows: 

Q (C7 )(fc+i) = -\og\C\+tr(CA) + J2wij\ci j \, (2.1.2) 

with Wij = p'\(\clj^\), and denoting the estimates obtained at previous step. 
The computation can be implemented by reiteratively using the graphical LASSO 
algorithm. 

2.2. Consistency of BIC with SCAD 

In literature, two approaches have been used for the selection of tuning pa- 
rameters under the penalized likelihood framework, including the BIC criterion 
(Yuan and Lin, 2007) and cross validation (Friedman, Hastie and Tibshirani, 
2008; Fan, Feng and Wu, 2009). The theoretical investigation of this paper will 
be focused on the consistency result regarding the model selection using BIC cri- 
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terion under the penalized likelihood framework with SCAD or adaptive LASSO 
penalty. 

For the tuning parameter A, it is desirable to have a data-driven method 
to make the selection automatically. Define the full graphical model Gp with 
the full edge set Ep = (eij)i<i<j< p . Define an arbitrary graphical model G with 
the corresponding edge set E C Ep. Define a true model Gp, with the edge 
set Ep = {eij)(i,j):cij -£o,i<ji where c^o denotes the null value of the parameter. 
Define an over-fitted model G if the corresponding edge set E D Ep and E ^ Ep. 
Define an under-fitted model G with the edge set E Ep. 

In practice, as A is unknown, we search for the optimal A from the bounded 
interval f2 = [0, A max ], for some upper limit A max . We further assume that the 
upper limit A max — > 0, as n — > oo. This implies that the search region shrinks to 
as n tends to infinity. Similar assumption can be found in Wang, Li and Tsai 
(2007). Given a tuning parameter A, the penalized likelihood approach yields 
the estimated parameters (cij,\)i<i<j<p- The resulting model is denoted as G\ 
with the edge set E\ = (e^) (ijy.c i:j A ^o- We define = {A G : E\ ~£ Ep}, 
n = {A G n : E x = Ep}, and fi+ = {A G : E x D Ep and E x / Ep}. The three 
subsets of fio, ^+ lead to the true, under and over-fitted models, respectively. 
Given a A, the associated BIC criterion is defined as: 

BIC x = -]0g\C x \+tT(C x A) + l ^& /(%,a/0). 

l<i<j<p 

On the other hand, suppose we know the correct model Gp beforehand and 
perform the maximum likelihood estimation. Under Gp, the parameters can be 
partitioned into two sets: = {cij : 7^ 0}, and = {cij : Cij = 0}. The 
resulted maximum likelihood estimator is denoted as Cq t = (Cq^,0), with 
known to be 0. The associated BIC criterion is denoted as 

BIC GT = -\og\C GT \+tT{C GT A) + ^^- Hcijfi^O). 

l<i<j<p 

In this subsection, we will focus on the discussion on SCAD penalty. We first 
construct a working sequence of reference tuning parameters X n = log(n)/y / n, 
which satisfies the requirement that as X n — > 0, y/n\ n — ► 00. Under such working 
sequence of tuning parameters, according to Theorem 5.2 in Fan, Feng and Wu 
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(2009), with probability tending to one, the resulted method will not only identify 
the correct set of true edges but also yield root-n consistent estimators for all the 
nonzero partial correlation coefficients. This guarantees the following result: 

Lemma 2.2.1. For SCAD penalty, Pr(BIC Xn = BIC Gt ) -> 1 as n -> oo. 

Proof. According to Theorem 5.2 in Fan, Feng and Wu (2009), under the refer- 
ence sequence of tuning parameters, we have lim n ^oo P(dij^\ n = (Hjfi) = 1- It 

follows that limn^oo P(J2i<j I (cij,\ n / °) = Ei<jA c y,o / °)) = L Due to tne 
oracle property, the proposed SCAD penalized likelihood approach estimates the 
parameter under the correct sub- model with probability tending to 1, namely, 
lim^oo P{C\ n = Cg t ) = 1- Then the result of the Lemma follows. 

□ 

Next we will consider the under-fitted model, which is essentially a misspec- 
ified model with at least one of the nonzero parameters being mistakenly set to 
zero. Given A G let denote (cij)(ij)e.E A) and let denote ((Hj)(i,j)fE x - 
The penalized likelihood C\ = (C^,0) is the local minimizer of 

Q x (C) = --£(C^,0)+ PaCM). (2- 2 - 1 ) 

(i,j)eE x 

Under the misspecified graphical model, the parameter space is denoted as C* = 
{C\cij = 0, for E\&ndcij / 0, for G E\}, which does not include 

the true value Co- According to the asymptotic theory for maximum likelihood 
estimation under misspecified model (White, 1982), the maximum likelihood 
estimates C will converge to C* almost surely where C* is the unique parameter 
in the under-fitted model which minimizes the Kullback-Leibler distance to the 
true model, namely 

C* = argmin CgC ^{log/(A;C7 ) - log f(X,C)}. (2.2.2) 

For Gaussian graphical model, the C* is uniquely defined as the — log/(X; C) is 
strictly convex and so is the — E {log f(X;C)}. We further partition the pseudo 
null value C* = (C< a \C*^),withC*^ = (C*-)^)^, and C*^ = (C*-)(ij)^ A 
= 0. It remains to show that the penalized likelihood estimator C\ is also a root-n 
consistent estimator to such pseudo null value C*. 
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Lemma 2.2.2. Given A £ f2_, let C* be defined as in Equation \2.2.2\) . Let the 

objective function Q\(C) be defined as in equation \2.2. with the penalty being 
the SCAD function. If X — > 0, as n — > oo, then there exists a local minimizer C\ 
ofQ\{C), such that C x -C* = O p (n - 3). 

Proof. Consider a constant matrix u with its vectorized form denoted as u. As- 
sume u £ C* and ||u|| = M. Let 1(C) = n/2(log|C| - tr(Cl)). For n large 
enough, we have 

? / 

n{Q x (C* + ^)-Q x (C*)) 

> _ 21{C * + ^-) + 2£(C*) +n Y, + ^ I) - Pa(I^I)} 

(i,i):<fV0 

v 1 m (2-2.3) 

> _ 2f (C*)4= + -n T r(C*Hl + Op(l))+ 
'n n 



-,,2 



n 



£ {p ' A (| c * |)^I + ^(| C * |)^(1 + (1)) } , 



with 



and 



_ ai(c(°>,o) , 
_ g 2 i(^,o) , 

* l ° j ~ Q2(j{a) l(C»(«),0)' 

It is known that for n large enough, and c*j ^ 0, p'\(\c*j\) = and p'\(\c*j\) = 

0. Furthermore, C* satisfies E{ 8 log J ff a f ,0) }|( C '(°),o) = °- This entails ^(C*) = 
O p (-y/n). By standard asymptotic theory, £"(C*) = O p (n). By choosing M large 
enough, the sign of Equation (|2.2.3j) is completely determined by the second term 
of its last line. This implies, for any given e > 0, by choosing a ball centered 
around C*, with radius M sufficiently large, we have 

VL 

P{ inf Q X (C* + —)>Q x (C*)}>l-e. (2.2.4) 
\\u \\=M Jn 



This guarantees that the local minimizer C\ is root-ra consistent for C* . □ 



The result above is helpful to understand the asymptotic property of BIC\ 
under an under-fitted model. Concerning an over-fitted model, some zero-valued 
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parameters are included in the model to be estimated, which can be regarded as 
nuisance parameters. Under an over-fitted model, the parameter space contains 
the correct null value of the parameters. Thus the property of the resulting 
BIC\ can be derived under the standard likelihood theory under correct model 
assumption. 

Lemma 2.2.3. If A max — * 0, and A max > log(n)/^/n as n — * oo, and the penalty 
is SCAD function, then Pr(inf Ag n_un + BIC\ > BIC\ n ) — ► 1. 

Proof. First we consider A £ fi_. According to Lemma 12.2.21 C\ is root-n con- 
sistent to C*. Furthermore, £'(C*) =O p {^/n), and £"(C*) = O p {n). We have 

£(C X ) =£{C*) + |lMCA - C*) + \(C X - C*) T ^\c*(Cx ~ C*)(l + o p (l)) 
=£(C*)+O p (l). 

(2.2.5) 

By similar argument, 1{Cg t ) = £{Cq)+O p {\). By the weak Law of large numbers, 

U{C*)^E{\ogf{X-C*)); 
n 

-£(C )^E(logf(X;C )). 
n 

Furthermore, E{\og f(X; C*)) < E(log f(X; C )) due to the Kullback-Leibler 
inequality. Thus, 

£(C ) -£(C*) =n[E(\ogf(X;C )) - EQog f(X; C*))] + o p (n). (2.2.6) 
This entails 

n(BIC x - BIC Gt ) =2£(C Gt ) - 2£(C X ) + logn(e A - e T ) ^ ^ 

=2£(C ) - 2£(C*) + logn(e A - e T ) + O p (l) > 0, 

where e A = J2i<j ¥= 0), and e T = Yji<j I ( c ij,o ¥= 0). 

Next consider A G Define the maximum likelihood estimator under the 
true model and under the over-fitted model as Cg t , and C\. Note that C\ is 
different from C A , as the former is the maximum likelihood estimate under the 
submodel E\, where the latter is the penalized likelihood estimate under the full 
model using A as the tuning parameter. According to the standard asymptotic 



10 



XIN GAO, DANIEL Q. PU, YUEHUA WU AND HONG XU 



theory for the loglikelihood ratio statistic, we have 2(£(C\) — £(Cq t )) ~ Xe x -e T = 
O p (l). Furthermore, the penalized likelihood estimators under the over-fitted 
model are denoted as C\. From Theorem 5.2 in Fan, Feng and Wu (2009), \C\ — 
C\\ = O p (n _ 2 ). This entails £{C\) = £(C\)+O p {\). Combining the results above, 
we have 

n(BIC x - BIC Gt ) = - 2£(C X ) + 2£{C Gt ) + logn(e A - e T ) 

= - 2£(C X ) + 2£{C Gt ) + logn(e A - e T ) + O p (l) (2.2.8) 
= logn(e A -e T )+O p (l) > 0. 

This completes the proof. 

□ 

This Lemma implies that the As that fail to identify the true model yield 
BIC always larger than \ n . Consequently, the A value which minimizes the BIC 
criterion will identify the true model. Combining the two lemmas above, we es- 
tablishes the consistency of the BIC criterion used under the penalized likelihood 
framework with the SCAD penalty. 

Theorem 2.2.4. 7/A max — > 0, and A max > \og(n)/^fn as n — > oo, then Pr(G^ Bic 
= Gt) 1, where Xbic is the tuning parameter that minimizes the BIC criterion 
with the SCAD penalty. 

2.3. Consistency of BIC with adaptive LASSO 

In this section, we focus on the establishment of the consistency result 
of BIC with adaptive LASSO penalty. Given any a n -consistent estimate C, 
namely, a n {C — Co) = O p (l), the weights of the adaptive LASSO are specified by 
Wij = l/\Cij\' y , for some 7 > 0. We first construct a sequence of reference tuning 
parameters which satisfies the requirement that as X n — > 0, \/n\ n = O p (l), and 
ii2 \ n aZ — > 00. Under such working sequence of tuning parameters, according to 
Theorem 5.3 in Fan, Feng and Wu (2009), with probability tending to one, the 
resulting method will not only identify the correct set of true edges but also yield 
root-n consistent estimators for all the nonzero partial correlation coefficients. 
This guarantees the following result: 

Lemma 2.3.5. Pr(BIC Xn = BIC Gt ) -> 1 as n -> 00. 
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Next we will consider the under-fitted model in a similar manner as what we 
have derived for SCAD penalty. 

Lemma 2.3.6. Given A G the corresponding misspecified model is denoted 
as G\. let C* be defined as in Equation h2.2.2\) . Let the objective function Q\{C) 
defined as in equation h2.1.1\) with adaptive LASSO penalty. If X n — ► 0, \/nX n = 
O p (l), and n^X n an — > oo. then there exists a local minimizer C\ of Q\{C), such 
that C x -C* 



O p (n 2). 



Proof. Consider a constant matrix u with its vectorized form denoted as u. As- 
sume u G C* and ||n|| = M. Let £(C) = n/2(log|C| - tv(CA}). For n large 
enough, we have 



n 

Jn 



> _ 2£(C* + ^=) + 2i(C*) +n Y, + % - Pa(|^|)} 

(i,i):clfM0 

> _ 2Z'(C*)4= + i«'(C*)«(l + op(l)) + nX n V {|c%r 7 ^sign(4)}. 

(M'K (a Vo 

(2.3.1) 

Using similar arguments as in Lemma 12.2.21 we have £'(C*) = O p {y/n), and 
t"{C*) = O p (n). Furthermore, Ic*^" 7 = O p (l), consistent estimator 

of c|- 0. Because -^/nA^ = O p (l), the third term is also O p (l). By choosing 
M large enough, the sign of Equation (I2.2,3j) is completely determined by the 
second term of its last line. This implies, for any given e > 0, by choosing a ball 
centered around C*, with radius M sufficiently large, 

77 

P{ inf Q A (C* + ^)>Q A (C*)}>l-e. (2.3.2) 
||u||=M yn 

This guarantees that the local minimizer C\ is root-ra consistent for C*. □ 

In light of the result above, we are able to study the asymptotic property of 
BIC\ under the under-fitted model and the over-fitted model. Let the working 
sequence of X n be defined as above. Following the same argument as in Section 
3.1, we have 
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Lemma 2.3.7. If A max — »■ 0, and A max > A n , as n — »■ oo, then Pr(mi X en-un + 
BIC X > BICxJ - 1. 

Theorem 2.3.8. If A max — > 0, and A max > A n , as n — > oo, i/jen Pr(G^ Bj = 
Gt) — ► L where Xbic is the tuning parameter that minimizes the BIC criterion 
with adaptive LASSO penalty. 

3. Simulation Studies 

Next we conduct simulation studies to investigate the performance of BIC 
in penalized likelihood estimation of Gaussian graphical model. The main focus 
is to use empirical evidence to support the consistency result of BIC with SCAD 
penalty or Adaptive LASSO. We also compare its performance with cross valida- 
tion, which is another commonly used tuning parameter selection method. The 
if- fold cross-validation method partitions all the samples into K disjoint subsets 
and denote the indices of subjects in fc-fold by T k , k = 1,...,K. The if -fold 
cross-validation score is defined as: 

K 

CV(A) = J> fe (-log|CV fc | + tr(C A) _*40), 
k=l 

where n k is the size of the subset C\-k is the estimated concentration matrix 
based on the sample Uj^fcTj, and A k is the sample covariance matrix calculated 
on subset T k . The optimum tuning parameter A is selected to minimize CV. In 
our simulation, K is set to be 5. 

We simulate three different graphical model structures. 

• Model 1. An AR(1) model is considered with cu = 1, and Cj ; j_i = (H-i^ = 
0.5. 

• Model 2. An AR(2) model is considered with cu = 1.5, Cj^-i = Cj_i ; j = 0.5 
and Cij-2 = Ci- 2 ,i = 0.40. 

• Model 3. A general sparse graphical model is considered. We employed the 
data generating scheme of Li and Gui (2006). To be more specific, we gener- 
ate p nodes randomly on the unit and square and obtained their pairwise Eu- 
clidean distance. For each point, it is connected with an edge to the points 



TUNING PARAMETER SELECTION FOR GAUSSIAN GRAPHICAL MODEL 



13 



with the 3 smallest distances. For each edge, the corresponding entry in the 
inverse covariance matrix is generated uniformly over [—1,-0.5] U [0.5,1]. 
In order to ensure the positive definiteness of the inverse covariance matrix, 
the magnitude of the ith diagonal entry is set as twice of the sum of the 
absolute values of all the off-diagonal entries on the ith row. 

For each model, we use penalized likelihood methods with SCAD, adaptive 
LASSO and LASSO penalties. The tuning parameter for all three penalties are 
selected through either the BIC criterion or the cross-validation criterion. To 
assess the model selection performance, we evaluate the sensitivity, specificity, 
and Matthews correlation coefficient (MCC) which are defined as follows: 

TN TP 
specificity = — — — — , sensitivity = 



MCC 



TN + FP"™™ J ~ TP + FN' 
LP x TN - FP x FN 



y/(TP + FP)(TP + FN)(TN + FP)(TN + FN) ' 
where TP, TN, FP, FN are the numbers of true positives, true negatives, false 
positives, and false negatives. Taking both true and false positives and negatives 
into account, MCC has been widely used to measure the quality of binary clas- 
sifiers. The larger the MCC is, the better the classifier performs. Means and 
standard deviations of the above measures are provided in Tables 1-3. Under 
each of the three models, we generated 100 simulated data sets with different 
combinations of p and n. We considered three scenarios: p = 35, n = 100, and 
p = 75, n = 100. and p = 35, n = 10000. Specifically, when p = 35, and n = 100, 
the total number of parameters in the inverse covariance matrix to be estimated 
is 630, which is larger than the sample size n = 100. When p = 75, and n = 100, 
the corresponding number of parameters is 2850, which greatly exceeds n. Those 
settings can be useful to reveal the empirical performance of the different com- 
peting methods when the number of parameters is greater than the sample size. 
The settings with p = 35, and n = 10000 are used to assess the consistency of 
different methods in model selection when sample size tends to infinity. 

The implementation is based on the GLASSO package in R (Friedman, 
Hastie and Tibshirani, 2008) and we apply the reiterative weighted LASSO (Fan, 
Feng and Wu, 2009) to to obtain the estimates for SCAD method. For adap- 
tive LASSO, we use the sample covariance as the initial estimate and obtain the 



14 



XIN GAO, DANIEL Q. PU, YUEHUA WU AND HONG XU 



weights based on the sample covariance with the power 7 set to 0.5. 

We examine the empirical performance of the three different penalty func- 
tions under the selection of optimal tuning parameter via BIC or cross-validation. 
Tables 1, 2 and 3 provide the average number of specificity, sensitivity and 
Matthew's correlation coefficient over 100 simulated data sets. Standard er- 
rors are provided in the parenthesis. For the three cases of different sample and 
matrix sizes, across all three different graphical structures, the adaptive LASSO 
consistently yields better performance than the LASSO penalty. SCAD also out- 
performs the LASSO penalty except for a few cases with the sample size n = 100. 
When sample size increases, the advantages of adaptive LASSO and SCAD are 
more pronounced. Very interesting to note that when n = 10000, the SCAD and 
adaptive LASSO with BIC can yield sensitivity and specificity close to almost 1. 
For AR(1), the average specificity and sensitivity for SCAD is 0.981 and 1.000, 
and for adaptive LASSO are 0.965 and 1.000. For AR(2), the specificity and 
sensitivity for SCAD are 1.000 and 1.000, and for adaptive LASSO are 0.984, 
and 1.000. For sparse graph with three edges per node, the specificity and sen- 
sitivity for SCAD are 0.999 and 1.000, and for adaptive LASSO are 0.992, and 
1.000. These results confirm with the theoretical results that when n tends to 
infinity, penalized likelihood estimation with SCAD or adaptive LASSO is con- 
sistent and selects the true graphical model with probability tending to one. In 
comparison, the average specificity and sensitivity that the penalized likelihood 
estimation with LASSO under BIC selection are much lower. For instance, for 
AR(1) model, the sensitivity is only about 0.721; for AR(2) model, the sensitiv- 
ity is only about 0.806. These results demonstrate that LASSO with BIC is not 
consistent in model selection across different underlying graphical structures. 

We also compare the performance between BIC and 5-fold cross validation. 
When sample size n = 100, there is no complete dominance of one tuning method 
over the other. For instance, under the sparse graph with three edges per node, 
the relative performance of BIC versus cross validation depends on the penalty 
function. When p = 35 and n = 100, the overall MCC of BIC is higher than cross 
validation for penalty of LASSO but lower than cross validation for penalties of 
SCAD and ADAP. When n = 10000, BIC is more advantageous as it consistently 
yields higher specificity, sensitivity, and MCC than cross validation for all the 
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penalties and across all the graphical models in the simulation study. Overall, 
the BIC method exhibits comparable performance as cross validation in the small 
sample size scenario, but it outperforms cross validation when sample size gets 
large. Computationally, BIC is more convenient to use as cross validation is K 
times more intensive to compute. 

4. Conclusion 

In this article, we investigate the tuning parameter selection for penalized 
likelihood estimation of the inverse covariance matrix in the Gaussian graphical 
model. We establish the consistency of the BIC criterion to select the true 
graphical model with the SCAD or adaptive LASSO penalty. Such consistency 
result of BIC can be extended to the general penalized likelihood estimation 
problems with these two penalties in other models satisfying mild regularity 
conditions. 
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Table 4.1: Results for AR(1) Graphical Model . Averages and standard errors from 100 runs 



p 


n 


Tuning 




LASSO 






SCAD 






ADAP 










SPEC 


SENS 


MCC 


SPEC 


SENS 


MCC 


SPEC 


SENS 


MCC 


35 


100 


BIC 


0.695 


1.000 


0.402 


0.710 


1.000 


0.413 


0.849 


1.000 


0.568 








(0.032) 


(0.000) 


(0.025) 


(0.020) 


(0.000) 


(0.017) 


(0.021) 


(0.000) 


(0.030) 






cv 


0.620 


1.000 


0.348 


0.705 


1.000 


0.410 


0.824 


1.000 


0.533 








(0.025) 


(0.000) 


(0.016) 


(0.016) 


(0.000) 


(0.013) 


(0.016) 


(0.000) 


(0.021) 


75 


100 


BIC 


0.791 


1.000 


0.362 


0.739 


1.000 


0.318 


0.867 


0.998 


0.453 








(0.018) 


(0.000) 


(0.017) 


(0.015) 


(0.000) 


(0.011) 


(0.011) 


(0.005) 


(0.016) 






CV 


0.712 


1.000 


0.299 


0.749 


1.000 


0.325 


0.901 


0.997 


0.515 








(0.017) 


(0.000) 


(0.012) 


(0.006) 


(0.000) 


(0.005) 


(0.007) 


(0.005) 


(0.015) 


35 


10000 


BIC 


0.721 


1.000 


0.424 


0.981 


1.000 


0.902 


0.965 


1.000 


0.839 








(0.025) 


(0.000) 


(0.022) 


(0.006) 


(0.000) 


(0.028) 


(0.008) 


(0.000) 


(0.029) 






CV 


0.521 


1.000 


0.290 


0.976 


1.000 


0.880 


0.917 


1.000 


0.697 








(0.030) 


(0.000) 


(0.016) 


(0.007) 


(0.000) 


(0.031) 


(0.017) 


(0.000) 


(0.040) 



X 

z 
o 
> 
p 

> 
3 

f 

£> 

-a 

a 

< 
x 

> 



3 
> 

a 

x 
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SCAD:the SCAD penalty; LASSO: the L x penalty; ADAP: the adaptive LASSO penalty 



Table 4.2: Results for AR(2) Graphical Model . Averages and standard errors from 100 runs 



p 


n 


Tuning 




LASSO 






SCAD 






ADAP 










SPEC 


SENS 


MCC 


SPEC 


SENS 


MCC 


SPEC 


SENS 


MCC 


35 


100 


BIC 


0.986 


0.459 


0.585 


0.982 


0.519 


0.616 


0.954 


0.754 


0.703 








(0.013) 


(0.113) 


(0.055) 


(0.016) 


(0.114) 


(0.050) 


(0.029) 


(0.135) 


(0.051) 






cv 


0.657 


0.960 


0.432 


0.812 


0.905 


0.554 


0.865 


0.910 


0.627 








(0.050) 


(0.026) 


(0.036) 


(0.058) 


(0.056) 


(0.045) 


(0.028) 


(0.039) 


(0.039) 


75 


100 


BIC 


0.996 


0.382 


0.563 


0.995 


0.420 


0.579 


0.992 


0.486 


0.610 








(0.003) 


(0.065) 


(0.035) 


(0.003) 


(0.065) 


(0.032) 


(0.005) 


(0.091) 


(0.041) 






CV 


0.837 


0.887 


0.445 


0.895 


0.829 


0.503 


0.998 


0.362 


0.563 








(0.043) 


(0.031) 


(0.036) 


(0.024) 


(0.045) 


(0.027) 


(0.002) 


(0.058) 


(0.039) 


35 


10000 


BIC 


0.806 


1.000 


0.606 


1.000 


1.000 


1.000 


0.984 


1.000 


0.954 








(0.031) 


(0.000) 


(0.038) 


(0.000) 


(0.000) 


(0.000) 


(0.006) 


(0.000) 


(0.020) 






CV 


0.470 


1.000 


0.330 


0.997 


1.000 


0.991 


0.931 


1.000 


0.810 








(0.032) 


(0.000) 


(0.019) 


(0.007) 


(0.000) 


(0.023) 


(0.020) 


(0.000) 


(0.045) 



SCAD:the SCAD penalty; LASSO: the L x penalty; ADAP: the adaptive LASSO penalty 
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Table 4.3: Results for a sparse Graphical Model with 3 edges per node. Averages and standard errors from 100 runs 



p 


n 


Tuning 




LASSO 






SCAD 






ADAP 










SPEC 


SENS 


MCC 


SPEC 


SENS 


MCC 


SPEC 


SENS 


MCC 


35 


100 


BIC 


0.983 


0.460 


0.562 


0.992 


0.366 


0.538 


0.988 


0.458 


0.584 








(0.012) 


(0.073) 


(0.044) 


(0.015) 


(0.117) 


(0.052) 


(0.011) 


(0.092) 


(0.053) 






cv 


0.988 


0.363 


0.546 


0.998 


0.354 


0.558 


0.995 


0.423 


0.591 








(0.067) 


(0.095) 


(0.049) 


(0.003) 


(0.059) 


(0.042) 


(0.003) 


(0.060) 


(0.051) 


75 


100 


BIC 


0.992 


0.416 


0.539 


0.997 


0.339 


0.534 


0.995 


0.389 


0.541 








(0.003) 


(0.040) 


(0.032) 


(0.006) 


(0.072) 


(0.026) 


(0.003) 


(0.045) 


(0.030) 






CV 


0.998 


0.353 


0.555 


0.998 


0.353 


0.549 


1.000 


0.303 


0.536 








(0.001) 


(0.030) 


(0.027) 


(0.001) 


(0.030) 


(0.023) 


(0.000) 


(0.024) 


(0.019) 


35 


10000 


BIC 


0.932 


1.000 


0.769 


0.999 


1.000 


0.996 


0.992 


1.000 


0.966 








(0.017) 


(0.000) 


(0.044) 


(0.002) 


(0.000) 


(0.009) 


(0.004) 


(0.000) 


(0.019) 






CV 


0.657 


1.000 


0.407 


0.997 


1.000 


0.991 


0.959 


1.000 


0.851 








(0.045) 


(0.000) 


(0.034) 


(0.003) 


(0.000) 


(0.017) 


(0.029) 


(0.000) 


(0.081) 



SCAD:the SCAD penalty; LASSO: the L x penalty; ADAP: the adaptive LASSO penalty 
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